Fault Tolerance and Resilience: Meanings, Measures and Assessment
نویسنده
چکیده
To assess in quantitative terms the “resilience” of systems, it is necessary to ask first what is meant by “resilience”, whether it is a single attribute or several, which measure or measures appropriately characterise it. This chapter covers: the technical meanings that the word “resilience” has assumed, and its role in the debates about how best to achieve reliability, safety, etc.; the different possible measures for the attributes that the word designates, with their different pros and cons in terms of ease of empirical assessment and suitability for supporting prediction and decision making; the similarity between these concepts, measures and attached problems in various fields of engineering, and how lessons can be propagated between them.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملApplication-Level Resilience Modeling for HPC Fault Tolerance
Understanding the application resilience in the presence of faults is critical to address the HPC resilience challenge. Currently we largely rely on random fault injection (RFI) to quantify the application resilience. However, RFI provides lile information on how fault tolerance happens, and RFI results are oen not deterministic due to its random nature. In this paper, we introduce a new meth...
متن کاملModelling Resilience of Data Processing Capabilities of CPS
Modern CPS should process large amount of data with high speed and reliability. To ensure that the system can handle varying volumes of data, the system designers usually rely on the architectures with the dynamically scaling degree of parallelism. However, to guarantee resilience of data processing, we should also ensure system fault tolerance, i.e., integrate the mechanisms for dynamic reconf...
متن کاملUsing Performance Tools to Support Experiments in HPC Resilience
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience...
متن کاملRestricted connectivity for three families of interconnection networks
Vertex connectivity and edge connectivity are two important parameters in interconnection networks. Even though they reflect the fault tolerance correctly, they undervalue the resilience of large networks. By the concept of conditional connectivity and super-connectivity, the concept of restricted vertex connectivity and restricted edge connectivity of graphs was proposed by Esfahanian [A.H. Es...
متن کامل